knitr::opts_chunk$set(fig.width=15, fig.height=10)
Prosper Marketplace, Inc. is a San Francisco, California-based company in the peer-to-peer lending industry.In this project, I use their dataset to explore their data which contains 113,937 loans with 81 variables.
Since the dataset is too large and has around 81 variables,I’m subsetting my dataset to consider only those variables which I think will give me a good detailed explanation of my borrowers!
## 'data.frame': 113937 obs. of 24 variables:
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric.: int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
After eye-balling the dataset, I saw that there were some variables which required reverse coding, numeric-to-factor conversion and other slight changes.
## Warning: package 'bindrcpp' was built under R version 3.3.3
##
## Defaulted Completed Current Late Payments
## 17010 38279 56576 2067
0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
##
## Debt Consolidation Home Improvement Business
## 58308 7433 7189
## Personal Loan Student Use Auto
## 2395 756 2572
## Baby & Adoption Boat Cosmetic Procedure
## 199 85 91
## Engagement Ring Green Loans Household Expenses
## 217 59 1996
## Large Purchases Medical/Dental Motorcycle
## 876 1522 304
## RV Taxes Vacation
## 52 885 768
## Wedding Other Not Applicable
## 771 10494 16965
4.From the variable dictionary, the two columns- CreditGrade & ProsperRating (Alpha) give us thhe credit rating for loans pre-2009 and post 2009 respectively.
Joining these two columns into a single column called CreditRating.
Since our dataset is about the loans, lets see how much loan amount is usually requested.
Plotting the histogram for LoanOriginationAmount.
From the above plot we see that the loan amounts 4000, 10000, 15000 and 20000 have count value more than 10000. This means that the bulk of loaners from Proper usually borrow smaller amounts. Lets, see why? Is it cause of the high Interest rate or based on the borrower characteristics?
Lets see how long are these loans usually taken for. It’s expressed in months in the dataset, converting months to year by dividing the term value by 12.
From the above plot, we can clearly say that most of the loans are lent on 36 months (= 3 years) or 60 months (= 5 years) basis. Very few are taken on 12 month basis
Lets see what’s the status of most loans?
So we see the certain loans have been defaulted and there are a lot of bands which show that they are quite some late payments.
## [1] 4.404188
## [1] 1.81416
The exact percentage of cases that are defaulted is 4.401% and number of late payments are 1.81%
Let’s explore what is the major category under which Prosper loans are issued
From above plot, its clear that most of the loans are issued under Debt Consolidation and the rest ambigious ones (Not Applicable) comes next but is much below the 20,000 count mark.
Let’s explore the data to understand more about our borrowers.
Looks like most of the borrowers are from Arizona State!
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Looking at the above plot, it looks like most of the loans are borrowed at the rate of 0.15 % -0.17% The next largest bin is 0.1% followed by 0.2% And looking at the whole graph, it looks normally distributed!
Looks like most of the borrowers belong to low to mid income range. The largest bin has users of income range $25K-50K followed by $50k-75k
## Warning: Removed 5584 rows containing non-finite values (stat_bin).
So from the above plot its clear that most of the borrowers have a credit score of 650-750.
After creating new variables from the existing dataset, my final dataset of 26 variables consists of 8 factor variables and rest integer variables.I have chosen the following variables to explore the characteristics of Prosper borrowers and the company as such. The most interesting plot which i found from my univariate analysis, is the time series plot where we saw how many loans were issued per quarter since 2006 to 2014-Q1. The second most interesting factor about the borrowers of Prosper is that the companies borrowing are all mid-level companies issueing loans at the rate of 0.15-0.17% mainly for Debt Consolidation purposes. Another interesting characteristics of these creditors is that they all have an average credit score between 650-750 and are from the state of Arizona.
Lets explore and see if there is any relation between the Credit Rating and Credit Score. Just to re-iterate, Credit Rating is a variable which i created by clubbing the columns - Credit Grade and ProperRating..Alpha
## Warning: Removed 852 rows containing non-finite values (stat_boxplot).
The Credit Rating assigned at the time of the listing was created: 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. So, looking at the plot, its clear that as the rating given my prosper is directly propostional to the credit score of the user. But though rating ‘AA’ has higher rating ‘A’ but the median score of AA looks lower than A. Let’s see what summary statistics tell us about this:
by(loan$CreditScore, loan$CreditRating, summary)
## loan$CreditRating: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 649.5 709.5 729.5 739.2 769.5 889.5 73
## --------------------------------------------------------
## loan$CreditRating: A
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 689.5 769.5 789.5 787.7 809.5 889.5 92
## --------------------------------------------------------
## loan$CreditRating: B
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 609.5 689.5 709.5 712.4 729.5 869.5 77
## --------------------------------------------------------
## loan$CreditRating: C
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 609.5 669.5 689.5 689.7 709.5 889.5 98
## --------------------------------------------------------
## loan$CreditRating: D
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 609.5 629.5 669.5 671.3 709.5 869.5 78
## --------------------------------------------------------
## loan$CreditRating: E
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 549.5 609.5 649.5 647.3 689.5 869.5 71
## --------------------------------------------------------
## loan$CreditRating: HR
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 529.5 669.5 632.4 689.5 869.5 91
From the summary statistics, we see that the median value for rating ‘AA’ is lesser than ‘A’ by 50 point and even the Mean value. Reading more about how Prosper’s properitiary system, it looks like they grade loaners AA not just by their credit history but also internal scoring system called Prosper Score and only those are given AA if they fall under low risk category and get lower interest rate (5.48%)
Now exploring Amount borrowed over time by Term
Though the plot isn’t very clear, but it looks like most of the large amount of money has been borrowed for long term usually for 60 months which is a 5 year period.
Lets plot Listing category and the loan Amount
Looks like highest loan amounts are borrowed under ‘Debt Consolidation’ and ‘Baby&Adoption’ category.Lets look at what rate at which these loans are issued
ggplot(aes(x=ListingCategory,y=BorrowerRate), data = loan) +
geom_boxplot() +
ggtitle('Rate of Interest for various Categories') +
xlab('Listing Categories') +
ylab('Borrower Rate')+theme(axis.text.x = element_text(angle = 45, hjust = 1))
Though the amount borrowed under Debt Consolidation and Baby Adoption is high but the borrower rate is pretty low. The highest rate is for Cosmetic Procedures and Household expenses.
Looking at Income range vs. Occupation
Looks like Nurses have taken a good share of loans and they seem to fall under the less risky category. But at the same time, they seem to hold good share every credit rating grade.
Now, lets see how the loan status varies w.r.t Loan Origination Amount
ggplot(aes(x=loan$newStatus, y=loan$LoanOriginalAmount,fill=loan$newStatus), data = loan )+geom_boxplot(lwd = 0.75, outlier.color = I("#FBB448"), outlier.shape = 1) + scale_x_discrete(limits = c("Defaulted", "Completed", "Current", "Late Payments"))
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
# Bivariate Analysis
After looking at all the plots in the Bivariate analysis, it looks like the Nurses are more in number of loans issued. They seem equally distributed in all the ratings provided by Prosper.This relates to why the lower income range [25000-49000] have issued the most number of loans. The most amount of loans where taken under the category of Debt COnsolidation and Baby & Adoption but the rate of interest was highest for the categories Household Expenses and Cosmetic Procedures.
The last plot where I explore the loan amount with the loan status. I derived the column column newStatus from the original column Loan Status, where I grouped statuses - Charged Off and Defaulted into Defaulted, Final Payment in Progress into Completed, all the late payments with various day deltas into Late Payments category.I removed the cancelled category and converetd them into NA.
The total dstribution for this new column is as follows
##
## Defaulted Completed Current Late Payments
## 17010 38279 56576 2067
## Warning in ggcorr(loan, label = TRUE, label_size = 3, hjust = 0.8,
## size = 2.5, : data in column(s) 'CreditGrade', 'ListingCreationDate',
## 'LoanStatus', 'ClosedDate', 'ProsperRating..Alpha.', 'BorrowerState',
## 'Occupation', 'IncomeRange', 'LoanOriginationQuarter',
## 'LoanOriginationDate', 'newStatus', 'ListingCategory', 'CreditRating',
## 'Term.f' are not numeric and were ignored
The correlation matrix revealed some interesting facts. The variables with correlation of 1 between variables Lender Yield and Borrower Rate, Creation Year and Origination Year, CreitScoreRangeUpper and CreditScoreRangeLower is all because they are more or less the same variables. The highest correlation in this matrix, is between the Current Delinquencies and Delinquencies Last 7 years which is 0.4 which means that those accounts are still open. One more surprising thing I noticed was that the BorrowerRate and CreditScore have a negative correlation of 0.5 and ProsperScore w.r.t LenderYield has a negative correlation of 0.6, this is unexpected since the score usually determines the risk factor which indirectly determines the rate of interest at which the loan is issued.
## Warning: Removed 3525 rows containing missing values (geom_point).
LenderYield w.r.t DebtIncomeRatio
## Warning: Removed 9090 rows containing missing values (geom_point).
The above two plots of DebttoIncomeRatio vs. Credit Score and LederYield respectively. The first plot shows that there is overplotting near the lower end of DebttoIncomeRatio, the defaulters tend to have low credit score. There is a line of plots at the higher end of DebttoIncome ratio which are mostly the ones who have completely paid back their loans. The second plot with Lender Yield, also has overplotting at the lower end of Debt-Income ratio. Most of the low risk prosper grades have a very low interest rate compared to the ones at higher risk.
Time series plot help us to better visualize the performance of Prosper, we can see the different trend lines, ups and downs of the company.I’m creating a separate dataset from our main dataframe Loan, with the Quarter variable and the original loan amount.
We see that the quarter4 in the year 2008, the number of loans taken gradually decreases and there are no loans issued in Quarter 1 of the year 2009, Q2-2009 there are only 10 loans issued and then it slowly increases. This pitfall is because of the Great Depression of 2008. The last quarter of 2013, Propser issued the maximum number of loans of 144500, followed by the year 2014, we have only the first quarter information and during that time, it seems like Prosper issued 121700 loans. This plot allows us to see how Prosper was performing since it has started, though we have the information only until its first quarter of 2014
I believe it doesn’t give us enough details when using listing category variable for trend or relationship analysis and the category is probably better left ignored during those types of investigations. But one thing we can take away from the high volume of borrowers requesting debt consolidation loans, is that, many borrowers have current debt and ver few percent have defaulted and is by far the most frequent category.
## Warning: Removed 272 rows containing missing values (geom_point).
I believe from this plot, its clear that most of the low risk loans have been issued for longer period from 36 to 60 months. The loans that were issued for a year or less is more in the higher risk borrowers.
In this analysis, my major hurdle was reading the entire csv due to its huge volume of 114k observations and 81 variables. My first thought to reduce the runtime was to subset the dataset, with my initial understanding of the variables from the Variable Description file, I started off with 15 variables. But as i progressed into my analysis, I realized that these 15 were not enough to get the results I was looking for. This re-iteration of going back to step 1 and re-selecting the variables went on for 3-4 times!Second biggest hurdle, was trying to figure out how to get the ratings for the loans issued before 2009.Since we had a good amount of data from 2007 quarter-1. I finally figured out a way to club the ProperRating..Alpha with the Credit Grade provided, to create a new column called Credit Rating. I also wanted to use the borrower state variable to generate a R-Map to see the different counts of defaulters, loanStatus per state. But do to some discrepency with the Mappers library or due to my mistake, I was unable to do so!
There are a number of different ways to take this project further. Firstly, I’ve focused only on a small subset of the variables available in the dataset, and there is a vast amount of data I’vent explored. I think if I better understood the finance/loan terms I would do better justice to this data analysis.I think I’d like to explore the investors side a bit more; look at investor profit and losses. Also, I would like to learn about the kind of plots and graphs specifically used by the finance industry, so that I can incorporate that knowledge into any future datasets I may explore, like using density line graphs showing spikes by time period.